Real Estate Dataset -Exploratory and Descriptive Analysis

Natacha Iradukunda and Gemima Grace Wishavura

Junior Data Analysts

2025-06-28

Introduction

In this notebook, we carry out an in-depth exploratory and descriptive analysis of Real Estate Dataset, a widely used dataset for income prediction tasks based on assesed value,sales ratio ans sales amount attributes.

This phase of analysis is essential for uncovering patterns, detecting potential biases, and gaining intuition about the dataset’s structure before applying any modelling procedures. We examine the distribution of key numerical and categorical variables, investigate relationships between assesed value and sales amount, and use visualizations to summarize insights.

Objectives

  • Understand property value distribution by analyzing the spread of sale amounts and assessed values across the dataset.

  • Identify dominant residential property types and examine how frequently each type appears in the market.

  • Evaluate location-based performance by comparing sales activity and average prices across top towns.

  • Analyze trends over time to observe how sale amounts and assessed values have changed year by year.

Dataset Overview

This dataset contains information on residential property sales, including details such as:

Sale Amounts and Assessed Values

Sales Ratio (sale price compared to assessed value)

Residential Property Types (e.g., single-family, condo, multi-family)

Towns where properties are located

List Years when properties were recorded or sold

The data provides a foundation for analyzing market value, property types, location trends, and yearly changes in the real estate sector.

Key Insights from Visualisations.

Sale Prices Are Right-Skewed Most properties were sold between $150,000–$200,000, with fewer high-end transactions. The distribution suggests median sale prices are more representative than the mean due to outliers.

Sales Ratio Mostly Below 1.0 The majority of properties sold for less than their assessed value, with a peak sales ratio around 0.65. This may indicate consistent over-assessment or buyer negotiation power.

Single-Family Homes Dominate Over 70% of listings are single-residence homes. This strong dominance suggests a market centered on individual homeownership rather than multi-family housing.

Bridgeport Leads in Activity Bridgeport had the highest number of property transactions (~6,000), far ahead of other towns, indicating a high-turnover, potentially affordable and active housing market.

Post-2018 Market Acceleration Both Assessed Values and Sale Amounts increased sharply from 2018 to 2019, showing a strong upward trend in market activity and valuation.

Number of Properties By Sales Amount

Number of Properties by Residential Type

Number of Properties By Sales Ratio

Test

We want to determine whether the average Sales Ratio significantly differs across categories like single-residence, appartments, duplex-residence, etc.

Code
import pandas as pd
import scipy.stats as stats
import seaborn as sns
import matplotlib.pyplot as plt

# Filter the relevant columns
data = real_estate[['Sales Ratio', 'Residential Type']].dropna()

# Group into lists per category
groups = [group['Sales Ratio'].values for name, group in data.groupby('Residential Type')]

# Visualize distributions
plt.figure(figsize=(10, 5))
sns.boxplot(x='Residential Type', y='Sales Ratio', data=data, palette='pastel')
plt.title('Sales Ratio Distribution by Residential Type')
plt.xticks(rotation=15)
plt.tight_layout()
plt.show()

# Check normality (Shapiro test for one group example)
normality_test = stats.shapiro(groups[0])  # just one group
print(f"Shapiro-Wilk test on one group: p-value = {normality_test.pvalue:.4f}")

# Levene's Test for equal variances
levene_test = stats.levene(*groups)
print(f"Levene’s Test for Equal Variances: p-value = {levene_test.pvalue:.4f}")

# One-way ANOVA
anova_result = stats.f_oneway(*groups)
print(f"\nOne-Way ANOVA Result:")
print(f"F-statistic = {anova_result.statistic:.4f}, p-value = {anova_result.pvalue:.4f}")

# Kruskal-Wallis Test (non-parametric)
kruskal_result = stats.kruskal(*groups)
print(f"\nKruskal-Wallis H-test Result:")
print(f"H-statistic = {kruskal_result.statistic:.4f}, p-value = {kruskal_result.pvalue:.4f}")

Shapiro-Wilk test on one group: p-value = 0.0000
Levene’s Test for Equal Variances: p-value = 0.0000

One-Way ANOVA Result:
F-statistic = 217.6152, p-value = 0.0000

Kruskal-Wallis H-test Result:
H-statistic = 973.9977, p-value = 0.0000

Interpretation

These tests together confirm strong evidence that Sales Ratios vary significantly by Residential Type:

The ANOVA and Kruskal-Wallis tests both show p-values < 0.05 (in fact, 0.0000), indicating the differences between groups are statistically significant.

Since data is not normally distributed and variances are not equal, Kruskal-Wallis is more reliable here — and it agrees with ANOVA.

This supports what you visually observed (e.g., non-overlapping medians, wide IQRs, distinct group behaviors).

Recommendation

Tailor pricing and investment strategies by residential type.

Since sales ratios vary significantly between property types (e.g., single-family, duplex, triplex), investors, assessors, and policy makers should avoid using a one-size-fits-all valuation or pricing model.

Instead, they should:

Apply residential-type-specific benchmarks when evaluating sale performance.

Pay extra attention to types with high variance or low medians (e.g., quadplexes), as these may indicate inconsistent returns or pricing inefficiencies.

Consider deeper location-based segmentation (e.g., average sales ratio by town and type) to refine insights further.

Sales by Top 20 Towns

Averagae Sales Ratio By Town

Average Sale Amount By List Year

Average Assessed Value By List Year

Residential Type Distribution

Recommendations

Based on our exploratory and statistical analysis of the real estate dataset, we offer the following key recommendations:

Review Assessment Accuracy Since most properties are sold for less than their assessed values (with Sales Ratios centered around ~0.65), valuation practices may need recalibration to better reflect actual market prices.

Prioritize High-Demand Residential Types Single-residence and duplex properties generally achieve higher sales ratios. Investors and planners should focus more on these categories for better returns and housing alignment.

Capitalize on High-Activity Towns Towns like Bridgeport and Hamden show high transaction volumes, indicating vibrant markets. These areas offer strong potential for targeted investment, development, or policy focus.

Track Market Shifts Over Time The post-2018 surge in both sale amounts and assessed values shows how property markets evolve. Year-over-year tracking is essential to anticipate price movements and demand trends.

Conclusion

This analysis of the real estate dataset has uncovered key insights into property valuation trends, sales patterns, and market behavior.

Most properties tend to sell below their assessed values, with significant variation across residential types and locations.

Single-family homes dominate the market, and towns like Bridgeport and Hamden emerge as hotspots for real estate activity.

The sharp increase in both assessed and sale values after 2018 suggests a market rebound or adjustment period.

These findings provide a strong foundation for data-driven decisions in valuation, investment, and urban planning.

Thank You!